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Abstract 

This paper is about the estimation of fixed model parameters in hidden Markov models us- 
ing an online (or recursive) version of the Expectation-Maximization (EM) algorithm. It is first 
shown that under suitable mixing assumptions, the large sample behavior of the traditional 
(batch) EM algorithm may be analyzed through the notion of a limiting EM recursion, which 
is deterministic. This observation generalizes results previously obtained for latent data model 
with independent observations. By using the recursive implementation of smoothing compu- 
tations associated with sum functionals of the hidden state, it is then possible to propose an 
online EM algorithm that generalizes an approach recently proposed in the case of HMMs with 
finite- valued observations. The performance of the proposed algorithm is numerically evaluated 
through simulations in the case of a noisily observed Markov chain. 

Keywords Hidden Markov Models, Expectation Maximization Algorithm, Online Estima- 
tion, Recursive Estimation, Stochastic Approximation 

1 Introduction 

Hidden Markov modelling certainly constitutes one the contributions of statistical time series anal- 
ysis which has had the most profound practical impact in the latest forty years. Hidden Markov 
models (HMMs) in their simplest form (i.e. when the state variable is finite) are sufficiently simple 
to give rise to efficient inference procedures while allowin g for useful model l ing of a wide range 
of situ ations. Ever since the pioneering contributions of iBaum and Eago 3 (|l967l ). iBaum et al " 



( 197d ). the EM (Expectation-Maximization) algorithm has been the method of choice for param- 



eter inference in HMMs. The EM algorithm is a dedicated numerical optimization routine which 
aims at maximizing the (log) likelihood of a batch of observations. It tends to be preferred to its 
alternatives due to its robustness and ease of application in various scenarios, especially in cases 
where the model parameters are constrained. 

This contribution is devoted to online parameter estimation for HMMs, in which the available 
observations are only scanned once and never stored, allowing for a continuous adaptation of the 
parameters along a potentially infinite data stream. In the case of HMMs, online parameter esti- 
mation is a challenging task due to the non-trivial dependence structure between the observations. 
The EM-inspired methods proposed so far have been either based on finite-me mory approximations 
of the required smoothing computations (Krishnamurthy an d Moore . 19931 ) or on finite- memory 



approximations of the d ata log- likelihood itself (Ry den, 1993)^ An alternative consists in using 



gradient-based methods ( Le Gland and Mevel 19971 ) whic h do not directly follow the principles of 



the EM algorithm. Recently, Mongillo and Deneve ( 20081 ) proposed an online version of the EM 



algorithm for HMMs in the case where both the states and observations take a finite number of 
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values. The key ingredient of this algorithm is a recursion which allows for data recursive compu- 
tation of smoothing functionals required by the EM algorithm. However, this recursion appears 
to be very specific and i t s pote ntial application to more general types of HMMs is not considered 



m 



Mongillo and Denevel ^2Q04 ). 



The purpose of this pape r is to build on the idea of Mongillo and Deneve ( 20081 ) in light of 
the framework introduced in ICappe and MoulinesI (120091^ for online EM es timation in the case 
of independent observations. The framework of Cappe and Moulines ( 20091 ) is first extended to 
the case of HMMs by exhibiting a limiting, or population-based, EM algorithm, corresponding 
to the case of infinitely many observations. The existence of the limiting EM algorithm does 
provide fruitful insights on the behavior of EM for ergodic HMMs when the number of obser- 
va tions gets large. However, and in contrast to the case of independent observations considered 
CaPDe and MoulinesI (|2009l ^. approximating the limiting EM algorithm with an online sample- 



in 



based stochastic approximation algorithm turns out to be a diffic ult task for HMMs. The sec ond 
contribution of the paper consists in recognizing the recursion of Mongillo and Deneve ( 20081 ) as 
an instance of the the recursiv e smoothing schemes for sum funct i onals described, among others, 
bv lZeitouni and Dembol (|l988l l. lElliott et al.1 dlQDSl ). ICappe etaP toO^ ). This observation makes 
it possible to propose a generic online EM framework for HMMs with finite state-space. 

The paper opens with a brief review of online EM in the case of independent observations. The 
main results, that is, the existence of a limiting EM recursion in the case o f HMMs (Theorem [H) 
and th e online procedure (Algorithm [T|) which generalizes the algorithm of ^ Mongillo and Deneve 
( 20081 ) are exposed in Section O Finally, Section H] is about the application of the proposed 
procedure in the specific example of a Markov chain observed in Gaussian white noise. 



2 Online EM in the Independent Case 

2.1 Fisher Relation and the Limiting EM Algorithm 

Consider the case of an i.i.d. (independently and identically distributed) missing data model, 
where {Yt)t£z denote the observation sequence and {Xt)t£Z are the associated latent (or unob- 
servable) variables, hereafter referred to as states. The joint probability density function (pdf) 
of Xt and Ij is denoted by pe{xt,yt) and teiyt) is the marginal pdf, or likelihood., associated to 
the observation, where 6 denotes the model parameter. In the following, it is assumed that the 
observation sequence is distributed under an actual unknown parameter value 6^, (a l thoug h the 
case where this assumption is relaxed has also been analyzed in Cappe and Moulined . 20091 ) . 



Under suitable regularity assumptions, it is well known that the normalized maximum-likelihood 
criterion }^Yl^=i^°^^d{Yt) tends, Pg^ almost surely, to the limiting contrast — K(£e^||£e), where 
K{qi\\q2) = f ^og^{y)qi{y)dy denotes the Kullback-Leibler divergence between qi and q2. Sim- 
ilarly, the intermediate quantity of EM, ^^Ylt=i^s[^'^SPe'{Xt,yt)\Yt] tends to the deterministic 
limit Fig_^\Eg{logpg' {Xt,Yt)\Yt)]. The limiting EM algorithm thus consists of 

E-step compute Eg^ [Eg^ {\ogpg{Xo, Yo)\ Yq)] ; 

M-sep set Ok+i = argmaxEg^ [E^^ {logpg{Xo,Yo)\ Yq)] . (1) 

It is straightforward to show that, as in the usual EM algorithm, each iteration of the limiting 
EM algorithm decreases the target criterion, that is K{£gJ\ig^_^^) < K{ig^\\£g^). Convergence of 
the limiting EM recursion to the set the stationary points of t he limiting contrast K (ig^\\ig) can 
be proved using the so-called Fisher identity (see discussion of Dempster et al. . 19771 ): 



Vglogig(Yo) = Eg [V glogpgiXo,Yo)\Yo] , (2) 
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where V denotes the gradient operator. The Fisher identity imphes that Eg^ [Eg ( Vg logpg{XQ, YqI ^o)] 
and Eg_^ [Vg log^5i(yo)] coincide and, hence, that stationary points of the hmiting EM mapping are 
also such that VKUg. \\£g) = 0. 



2.2 Exponential Families and the Sufficient Statistics Reparameterization 

In general, the algorithm in ([1]) is not very explicit and is mostly useful in case where the joint 
pdf pg belongs to an exponential family: 

pg{x, y) = h{x, y) exp ((V'(<9), s(x, y)) - A{e)) , 

where (•) denotes the scalar product, s{x,y) are the (complete-data) sufficient statistics and A{6) 
is the log-partition function. Furthermore assume that the equation {'Vgip{6),s) — 'VgA{9) = 
has a unique solution for all achievable values of s, which is denoted hy = 9{s) (for a canonical 
exponential family, where 'i/'(^) = ^! this requirement is equivalent to assuming that the Fisher 
information matrix is positive definite for all values of 9). Then, the limiting EM algorithm in ((!]) 
may be equivalently rewritten as 



HEg^ [EgJsiXo,Yo)\Yo)]) . 



(3) 



The algorithm may also be equivalently written in terms of the sequence of associated sufficient 
statistics which are such that Of^ = 0(8^). Under this reparameterization, the limiting EM algo- 
rithm obeys the simple recursion 



k+l 



Eg 



E 



g^s,){s{Xo,Yo)\Yo) 



(4) 



where the stationary points of K{ig^\\ig) are now more explicitly identified as the roots of the 
equation S = Eg^ Eg(5) {s{Xo,Yo)\Yo) 

2.3 Additive Decomposition and the Stochastic Approximation Algorithm 

Of course, the limiting EM algorithm discussed above is not a parameter estimation procedure 
as it requires the knowledge of To obtain a practical estimation algorithm, one simply needs 
to observe that Eg^ [Eg {s{Xq,Yo)\Yo)] may be estimated consistently from the observations as 
^Ylt=i'^^ i^(-^t,Yt)\Yt]. The online algorithm is then obtained by using the usual stochastic 
approximation (or Robbins-Monro) procedure 



n+l 



7n+lEg(^^) [s(X„+i, y„+l)| Yn+l] + (1 - 7n+l)5'„ 



(5) 



where (7n.) is a decreas i ng sequence of step-s izes. The principle of SEjj has bee n first exposed by 
Neal and HintonI dlQQfll'l.lSato and Ishiil tood) and latter extended by lSatd ICappe and Moulinei 

(|2009l ). ICappe and Moulined (|2009l f analyzed the recursion in ^ to show that, under suitable 
assumptions: (i) it is indeed consistent, converging to the stationary points of K{ig^\\ig); (ii) by 
properly choosing the rate of decrease of the step-sizes 7^ and using Polyak-Ruppert averaging (see 
also Section below) . 9n+i = 9{Sn+i) is an asymptotically efficient estimator of Compared 
to gradient algorithms, it is quite remarkable that the algorithm in Q can achieve asymptotic 
efficiency without trying to explicitly estimate the Fisher information matrix. Furthermore, it can 
be observed that the choice of performing the stochastic approximation in the domain of the suffi- 
cient statistics rather than in the parameter domain, that is, using ^ rather than ([3]) is also most 
natural given that only (j4j) may be directly estimated by a running average of properly selected 
functions of the observations. 
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3 Online EM for HMMs 



I now discuss the generalization of the ideas presented above to the case of Hidden Markov mod- 
els. Quite surprisingly, there are, under suitable mixing assumptions, direct analogs of the ideas 
presented in Sections 12.11 and 12.21 The tricky part consists in finding a suitable replacement for 
the stochastic approximation procedure of Section 12.31 which does not apply exactly for HMMs 
due to the time dependence. 

In this section, it is assumed that the state and observation sequences, {Xt,Yt)t^i are generated 
under a stationary Hidden Markov model with parameter 6^,; £g^, pg^, Fg^ and Eg^ refer to, 
respectively, the likelihood, the joint density of the states and observations, the probability, and 
,the expectation under this model. In practice, one observes the observation sequence {Yt)t>o 
starting from time only and the postulated initial distribution u will be be arbitrary; i^^g, p^^g, 
P,y^g and E,^ refer to the same quantity as previously but computed under this second model. 
Note that z/ is not considered as a model paramet er as it cannot be e stimated consistently from a 



single trajectory (see also Chapters 10 and 12 of ICappe et al.l . l2005l on this point). Finally, it is 



assumed that the state variable takes its values in the finite set X and the state transition matrix 
and state conditional pdf that characterize the HMM are denoted, respectively, by qg{x,x') and 
9e{x,y). 

3.1 The Limiting EM Algorithm 

Under suitable assumptions (see below), the normalized HMM log-likelihood ^ log£i,^g(YQ, . . . ,Yn) 
converges, Pg^ almost surely and in L^, to the limiting contrast 



cgM = ^eA^ogeg(Yo\Y- 



oo: — 1 



)]• 



(6) 



The same is true for the normalized score ^\/glogii,^g{YQ, . . . ,Yn) which converges to Vcg^{9). 
Such consistency result s have been e s tablis h ed, under v a rious assumptions, by (among others) 
Baum and Petriel (jlQed ). iBickel et al.l (|l998l ). [Pouc et al.l (j2004l ). Now, thanks to Fisher identity, 
for all n, 



n 



Vglogl^^Yo,. 



,Yn) 



-E,. 



n 



^VglogpgiXt,Yt\Xt-i: 



t=l 



Yr 



0:n 



+ -E,,e [Vglogp,,g{Xo,Yo)\ Yq-.u] ■ (7) 



n 



For simplicity, the last term on the right-hand, whose influence is clearly vanishing with increasing 
values of n, will not be considered in the following. Hence, the consistency result for the score 
function combined with ([7]) implies that ^E^^^g ['^^^-^^\7glogpg{Xt,Yt\Xt-i)\Yo:n] also converges 
Fg^ almost surely to Vgcg^{9), the gradient of the limiting contrast. 

To obtain an alternative representation of this limit, assume that both qg and gg belongs to 
exponential families such that 



{x, x') = h'^ix, x') exp {{tpi{e),s'i{x, x')) - Ai{e)) , 
(x, y) = h^ix, y) exp ((V^(^), s^(x, y)) - A<^{e)) . 



(8) 



Note that under our assumption that X is finite, the first requirement is always satisfied. 
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For the sake of conciseness, I will adopt in the rest of Section [3] a condensed representation 
-which is also slightly more general than the HMM case- by assuming that the joint state and 
observation conditional density pg{xt,yt\xt^i) belongs to an exponential family such that 



Peixt,yt\xt^i) = h{xt,yt) eyiYi {{tpiO), s{xt-i,xt,yt)) - A{e)) 

where h{x,y) = hi{x,x')h9{x,y), A{9) = A'i{e) + {6) and 

f I ^_fsi{x,x') 
s9{x',y) 



(9) 



^(6') = ( JgJ^J ) , s{x,x',y) 



In this case, the non-vanishing term in the r.h.s. of ^ may rewritten as 
1. 



n 



t-i, 



t=i 



Yo:r. 



1. 



n 



^siXt-i,Xt,Yt] 



t=i 



Yc 



0:n 



VgA{9). (10) 



The following theorem defines the limiting behavior of the r.h.s. of the above equation, and thus, 
the limiting EM algorithm for HMMs (see Appendix |A] for the corresponding proof). 

Theorem 1. Assume that (i) is a finite set; (ii) the transition matrix is such that qQ{x,x') > 
e > for all £ 9; (iii) supgsupyge{y) < oo and E^^ [jloginf^ ^e(io)|] < oo, where gg{y) = 

9di^^y)> (^^) parameter space Q is compact and 9^, £ interior(O); (v) ipg, Agj'ipg, Ag 
in ([8]) are continuously differentiable functions on inter ior(0); and, (vi) the equation {'Vq'iP{9),s) — 
'VdA{9) = has a unique solution, which is denoted by 9 = 9{s). 

Then, 



1e. 

n 



Y,siXt-i,Xt,Yt: 



t=l 



Yr 



0:n 



Ee, {Ee[s{X^uXo,Yo)\Y^ 

oo:oo 

]) , Fg^ a.s. 



(11) 



and the stationary points of the limiting EM algorithm 

9k+i = 9{Eg^ (EeJs(X_i,Xo,yo)| 5^-00:00])} 
are the stationary points of the limiting likelihood contrast cg^{9). 

3.2 Online EM 

Theorem [1] suggests a principle similar to the case of the i.i.d. mixture model of Section [2l 
To obtain an online algorithm however, one needs to be able to estimate consistently the limit 
Eg^ {Eg[s{X_i,Xo,Yo)\Y_oo:oo])- The normalized sum ^E„^g s{Xt-i, Xt,Yt)\Yo:n] is not 

directly a candidate as it is well known that it cannot be computed rec ursively when incorporating 
new observations. However, following the idea originally proposed by ( Zeitouni and Dembol . 19881 . 
Elliott et al.l . ll995l l. define 



.,u,9ix) = Fu,e iXn = x\ Yo.r. 



Pn.u,e{x) = — Ej,0 

n 



Y,s{Xt-i,Xt,Yt: 



t=i 



Xn X 



(12) 
(13) 
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which are such that ^Y^^^zx 4'n,,y,e{x)pn,u,e{x) = ^u,e[Y^t=i s{Xt-.i, Xt,Yt)\Yo:n]. The appeal of 
this new decomposition being that <f>n u o 

and Pn,u,e can be updated recursively as shown by the 

following Proposition. 

Proposition 1. (j)n.u,e o,nd Pn.ufi may he computed according to the recursion 
Initialization Compute, for x ^ X , 

Po,u,e{x) = 

Recursion For n >0, compute, for x £ X, 

_ Ylx'&x 4>n,,y,e{x')qe{x' , x)ge{x,Yn+i) 



Pn+i,u,e(.x) = Y] I r^(^''^'^" 
^-^ I n + 1 



(14) 



,(-, f /^l (t)n,ufi{x')qe{x',x) 

+ V ^TTTJ ^"■^•'^"^/E."e;.'^n,.,.(x")'Z.(x",x) ^'^^ 

In Proposition [1] above, the rightmost term in (jlSp corresponds to the backward retrospec- 
tive probability P,y^e(X„ = x'\Xn+i = Xjlom), which does not depend on the newly available 
observation Yn+i- The main argument in proving Proposition [1] is to check that 

^u,e{Xt = Xt,Xt+l = Xt+l\Xn-i-l = Xn+i, Yoin+l) = 

Pu,e{Xt = Xt,Xt+l = Xt+l\Xn = Xn,YQ-n)P,^^e{Xn = = Xn+l,^0:n) 

for all indices < t < n — 1 which implies the claimed result by summation (see Chapter 4 of 
Cappe et all bOQ^ for a complete proof). 



Proposition [H constitutes a recursive rewriting of the computation required to carry out the 
E-step in the batch EM algorithm. By analogy with the case of independent observations, the 
proposed online EM algorithm for HMMs takes the following form. 

Algorithm 1. Chose a decreasing sequence (7n)n>i of step- sizes, which satisfy the usual stochas- 
tic approximation requirement that X]„>x7n = oo and X^„>]^7^ < oo. Also select a parameter 
initialization Oq and a minimal number of observations nmm required before performing the first 
parameter update. 



Initialization Compute, for x £ X , 

Mx] 



'^ix)gg^lx,Yo) 



J2x'ex'^i^')9§JyX',Yo) ' 
poix) = . 
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Recursion For n > 0, 

Compute, for x G X, 



4>n+l{x) 



Pn+l{x) = ^ {-fn+is{x' , X,Yn+l) + {I - ^n+l)Pn{x')} 



x'eX 



(i>n{x')qgj^x',x) 
Ex"<^xMx")qgJ.^",x) 



(16) 

(17) 



If n > rijnin, update the parameter according to 

9n+l = ^ ( XI Pn+l{x)4>n+l{x) , 

\xex J 



otherwise, set 9n+i = On- 



3.3 Discussion 



Mongillo and Deneve ( 20081 ) considered, the particular case of finite valued HMMs where the ob- 
servations (lt)t>i also take their values in a finite set y. I n such a situation, it is easil y checked 
that whatever the chosen parameterization of the model ( Mongillo and Deneve ( 20081 ) consider 
only parameterization by the sets of conditional probabilities), the complete-data sufficient statis- 
tics can be chosen to be s(Xt--\ , Xt, Yj) = {l{Xt-i = i,Xt = j, Yt = ^})(ij,/c)GA'2xy- The recursion 
derived by Mongillo and Deneve ( 20081 ) for this case is based on recursively updating the product 
Tn,u,e{x) = 4>n,u,dix)Pn,i/,eix) rather than Pn,u,e{x). The probabilistic interpretation of the new 
term T„,^,e(x) is E^^g[{Y,t=i s{Xt-i, Xt,Yt)) l{Xn = x}\Yo.,n]. By multiplying ([IT]) by 0„+i(x) 
and using (fT6j) and , one obtains the following online update 



fn+l{x) =7n+l X s{x',X,Yn+l] 



+ (1 - 7n+i) X r„(x' 



(i>n{x')qgj,x' , x)g^J^x', Yn+l) 
Ex',x"€X2 Mx')q§^{x', x")%(x", Yn+l) 

qn(x',x)gA(x',Yn+i) 



x'eX 



Ex',x"ex^ Mx')qeJ.x', x")%(x", Yn+l 



(18) 



which coincides with Eqs. (15)-(16) of Mongillo and Denevel ( 20081 ). for the particular choice of 
complete-data sufficient statistics discussed above. 

Of course, using either (fT7|) or (fTSj) is practically equivalent. The form of (fT7|) is preferable 
from a conceptual point of view as it clearly shows that the new observation Yn+i only plays 
a role in the filter update ()16p . The limiting behavior of pn{x) is also expected to be simpler: 
From corollary [1] and proceeding as in the proof of Theorem [H it is easily shown that, for any 
X e X, Pn,u,9{x) indeed converges Pg^ almost surely to the same fixed limit as that exhibited 
in Theorem [H that is, Eg^ (Eg [s(X_i, Xq, lo)| ^-oo:oo])- Hence, it is conjectured that pn{x) will 
tends, as n increases, to a limit that is in dependent of x. 

Regarding the choice of the step-size, iMongillo and Deneve (I2OO8I ) consider the cases where, 
either, the step-size 7^ is small but non-decreasing, which may be useful for tracking potential 
changes but is not sufficien t to guarantee th e cons istency of the approach, or when 7^ = 1/n, 
which is selected, following Neal and Hinton ( 19991 ). by analogy with the batch EM algorithm 
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and its recursive rewriting in Proposition [TJ In general however, the choice 7„ = 1/n is not very 
recommendable for a stochastic approximation procedure and step-sizes of the form 7„ = 1/rC 
wi th 7 in the range 05-0.6 com bined with Polyak-Ruppert averaging are preferable (see discussion 



m 



Cappe and Moulines . 20091 ) . This point is illustrated in the numerical simulations of Section H] 



below. The role of nm i n is o nly to guarantee that the M-step update is numerically well behaved 
(jCappe and MoulinesI liooil ) and for this purpose, a small value of n-min is usually sufficient (for 
instance, n^i^ = 20 is used in the simulations of Section H^j) . 

Regarding the nu r nerica l complexity of Algorithm [H observe that in the case considered by 
Mongillo and Denevj (|2008l ) where s{Xt-^i, Xt,Yt) = {l{Xt-i = i,Xt = j,Yt = A;})(jj,fc)gA'2xy, 
s{Xi,-i, Xt,Yt) is a vector of dimension [A'p x |3^| (where | • | denotes the cardinal of the set). 
Thus, the numerical complexity of (1180 is of order lA"!^ x |3^| per observation. For this case, 



it is indeed possible to bring down the numerical complexity to the order of \X\ + \?(!\ 



\y\ 



operations by updating separately the terms corresponding to the two statistics {l{Xt_i = i,Xt = 
i}){j,j)GA'2 and {l{Xt = j,Yt = fc})(j,fc)eA'xy (see the example considered in the next section for 
more details). Interestingly, the numerical complexity of the bat ch EM algorith m for this model, 
when implemented using traditional forward-backward smoothing Rabiner ( 19891 ). is of the order of 
+ \X\ X |3^|) per observation and per iteration of the EM algorithm. Although, the comparison 
is not directly meaningful as the batch EM algorithm does necessitate several iterations to converge 
(see numerical illustrations in Section 14.30 , it is true that the scaling of the numerical complexity 
of the online-EM algorithm with \X\ may constitute an hindrance in models with a large number 
of states. This being said, the complexity of online gradient-based approaches, is equivalent as the 
main burden comes from the necessity of updating, via a recur sion related to (1171), q r ie coo rdinate 



of the gradient for each of the couples {x,x') G (see, e.g., Le Gland and Mevel . 199?! ). Note 
that if the transition matrix is structured — i.e., parametered by a low dimensional parameter 
rather than by all its individual entries — , the numerical cost associated to the implementation of 
the approach will be reduced to an order of the number of parameters times \X\'^. 



4 Application to Gaussian HMMs 



4.1 HMM with Product Parameterization 

For the sake of concreteness, I consider in following the case where the state variables {Xt}t>Q 
take their values in the set {1, . . . ,m}. In addition, assume that, as is often the case in practise, 
the parameter 9 may be split into two sub-components that correspond, respectively, to the state 
transition ma trix go and to the state - condi tional densities {ge{i, •)}i<i<»n- In the fully discrete case 
considered in Mongillo and Deneve ( 20081 ) for instance, the parameter 9 consist of the transition 
matrices qq and gg parametered by their respective entries, with the constraint that each line of 
a transition matrix must sum to one. In the case of Gaussian HMMs used in speech processing 
as well as many in other applications, the parameters are the state transition matrix qg and 
the mean vector and coyarianc e matrix associated with each of the m state-conditional densities 
{ge(.i,-)}i<i<m (|Rabineil .E989l). 

In such a model, there are two distinct types of EM complete data sufficient statistics which 
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give rise to two separate forms of the auxiliary function pn,u,9- 

1 

.t=i 



1 

n 



t=l 

J2HXt = i}s{Yt) 



t=o 



k 



(19) 
(20) 



where the form of s itself depend on the nature of the state-conditional distribution gg(x,-) — see 
Gaussian example below. There's a slight difference between (|20p and (jl3p . which is that (j20p also 
incorporates the initial (t = 0) conditional likelihood term, i.e., the contribution corresponding to 
the rightmost term on the r.h.s. of ([7]). As noted earlier, this difference is minor and does not 
modify the long-term behavior of the algorithm. 

With these notations, Eq. (|17p in Algorithm [1] is implemented as 



pl+i{i,j,k) = 7n+i(J(j - k)fn+i{i\j) + (1 - 7n+i) pl{i,j,k')fn+iik'\k) , 

k'=l 

m 

pi+iih k) = jn+iS{i - k)s{Yn+i) + (1 - 7n+l) ^ p?i{i, k')fn+i{k'\k) , 



(21) 
(22) 



fc'=i 



where 5 denotes the Kronecker delta (i.e., S{i) = iff z = 0) and the notation rn^i{i\j) refers to 
the approximate retrospective conditional probability : 



rn+iii\j) 



Em 
i'=l 



(23) 



A complete iteration of the online algorithm involves the approximate filter update (I16p and 
the stochastic approximation statistics updates (|2ip and (j22p followed by an application of the M- 
step function 9 to S^^iiiJ) = T.T=iPl+ii^^j'k)4>n+i{k) and S'^+i(i) = Y.'k=i Pn+i{hk)4>n+i{k). 
The form of the M-step depends on the exact nature of qg and gg. If the transition matrix qg is 
parametered simply by its entries, the update is generic and is given by 



Snihj) 



(24) 



For the update of the state-dependent parameters, one needs to be more specific and the form of 
the equations depend on the choice of the state conditional density gg{x,-). In the multivariate 
Gaussian case, the function s has to be chosen such that s{y) consists of the three components 
{l,y,yy*}. The corresponding components of the approximated EM extended statistics are de- 
noted, respectively, by 5^ q, 5^ 2- If state conditional Gaussian densities are parametered 
by their mean vectors, pg{i), and covariances matrices, Tig{i), the M-step update is defined as 



'S'n.o(^) 



Q I qy qy qy 



(25) 



(26) 



The derivation of ()24p and (|25p - (l26p is straightforward but some more details are provided in the 
next section for a particular case of Gaussian HMM. 
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4.2 Markov Chain Observed in Gaussian Noise 



In the numerical experiments described below, I consider the simple scalar model 
Yt = Xt + Vt, 

where (Vj) is a scalar additive Gaussian noise of variance v and {Xt) is a Markov chain with 
transition matrix q, which takes its values in the set . . . , /x(m)}. Although simple, this model 

is already statistically challenging and is of some irnporta nce in several app lications, in particular , 



as a b asic model for ion channels data (IChung et al 



port a n 
Il99nl ^ 



-see also, e. g.. (IRoberts and Ephraiml . 



20081 ) for discussion of the continuous time version of the model as well de Gunst et al. ( 2001 ) , for 



an up to date account of models for ion channels. The parameter 9 is comprised of the transition 
matrix q, the vector of means /x and the variance v. 

In this case, the intermediate quantity of the batch EM algorithm may be written, up to 
constants, as 



mm ^ m 

i=l j=l i=l 



(27) 



where 



1 

— Ei/6» 
n 




— Ei/6» 

n 


n 
_t=0 


— Ei/e 
n 


n 
. i=0 


n 


n 
.t=0 



t-1 



Maximization of (j27|) with respect to 

SUi) 



i,Xt=j} 



Yr 



0:n 



Yr 



0:n 



Yr 



0:n 



Yr 



0:n 



and V0 directly yields (p^ as well as 



„{i) = e{s%,sl. 



D I q9 q9 q9 
^ I '-'n,0' '-'n,!' '-'n.,2 



ET=i [SU^ - ^^H^)s%{^: 



(28) 
(29) 



It is easily checked that, as usual, the M-step equations (124^ and (I29p do satisfy the constraints that 
g be a stochastic matrix and v be non-negative. Note that for this particular model, the use of the 
statistic 5^ 2 could be avoided as it is only needed in the M-step under the form X^I^i S^ 2(^)1 which 
is equal to ^ Ylt=o ■ Algorithm [2] below recaps the complete online EM algorithm pertaining to 
this example. 



Algorithm 2 (Online EM algorithm for noisily observed m-state Markov chain). 
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Initialization Select 6q and compute, for all 1 < i,j, k,< m and < d < 2, 

Efe'=i%(^>^o) 
Pl{hj,k) = , 
plJi,k)=6{i-k)Yo^. 

Recursion For n > 0, and 1 < i,j,k,< m, < d < 2, 
Approx. Filter Update 

Efc^=i 4>n{k')qn{k' , k)ggj,k, Yn+i) 



Pn+l 



{k) 



where g§Jyk,y) = exp [-{y - /i„(A;))2/2£>„] . 
Stochastic Approximation E-step 

m 

Pn+i{i,j,k) = 7n+i(5(i - k)fn+i{i\j) + (1 - 7n+l) ^ pUh j , k')fn+i{k' \k) , 

k'=l 

m 

P'n+Uh k) = 7n+i<5(i - A;)y„'Vi + (1 - 7n+i) 5] P%{h k')fn+i{k'\k) , 



k'=l 



where r„+i(i|j) = 4>n{i)qn{hj)/YA^=i(i>n{i')qn{i',j)- 
M-step Ifn> nmin, 

m 

5^+1 (i,i) = X] Pn+li^d^k')(i>n+lik') , 



k'=l 



Qn+i{i,3) 



m 

= Pi+i4{hk')^n+i{k') , 



Pn+l{i) 



Vn+l 



k'=l 
'5n+l,l(0 



I^i'=l ( 'S'n+1,2(^') ~ An+l(^')5'^+l,o(*') 



4.3 Numerical Experiments 

Algorithm [2] is considered in the case of a two-state (m = 2) model estimated from trajectories 
simulated from the model with parameters 

g^(l,l) =0.95, M*(l) = 0, 
g.(2,2) =0.7,^.(2) = !, 
= 0.5 . 
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With these parameters, state identification is a difficult task as the separation of the means 
corresponding to the two state is only 1.4 times the noise standard deviation. The optimal filter 
associated with the actual parameter does for instance misclassify the state (using Bayes' rule) in 
about 10.3% of the cases. As will be seen below, this is reflected in slow convergence of the EM 
algorithm. 

All estimation algorithms are systematically started from the initial values 

g*(l,l)=0.7,/x.(l) = -0.5, 
g^(2,2) =0.5, ;u^(2) = 0.5, 

and run on 100 independent trajectories simulated from the model. 

1 

0.9 

cr 

0.7 



10 20 30 40 50 60 70 80 90 100 

0.05 


^ -0.05 
^ -0.1 
-0.15 
-0.2 

10 20 30 40 50 60 70 80 90 100 
EM iterations 

Figure 1: Estimated values of g(l, 1) (top) and ^[1) (bottom) as a function of the number of batch 
EM iterations for n = 500 (dotted lines) and n = 8000 (solid lines) observations. The plot is based 
on 100 independent realizations summarized by the median (bold central line) and the upper and 
lower quartiles (lighter lines). 

Figure [T] illustrates the consequences of Theorem [1] by plotting the estimates of the parameters 
(7(1, 1) and ^{1) obtained by the batch EM algorithm, as a function of the number of EM iterations, 
for two different sample sizes: n = 500 (dotted lines) and n = 8000 iterations. To give an idea 
of the variability of the estimates. Figure [T] feature the median estimate (bold line) as well as the 
lower and upper quartiles (lighter curves) for both sample sizes. The first striking observation is 
the slow convergence of EM in this case, which requires about 50 iterations or so to reach decent 
estimates of the parameters. When comparing the curves corresponding to the two samples sizes, 
it is also obvious that while the variability is greatly reduced for n = 8000 compared to n = 500, 
the median learning curve is very similar in both cases. Furthermore, the plots corresponding to 
n = 8000 provide a very clear picture of the deterministic limiting EM trajectory, whose existence 
is guaranteed by Theorem [H 

Indeed, the large sample behavior of the batch EM algorithm is rather disappointing as using 
a fixed number of iteration of EM does involve a computational cost that grows proportionally 
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Figure 2: Estimation results when using 50 batch EM iterations. From left to right, estimated 
values of q{l,l), ;u(l) and v for values of n ranging from 0.5 to 128 thousands of samples. Box 
and whiskers plot based on 100 independent realizations. 
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Figure 3: Estimation results when using the online EM algorithm with 7„ = n~^'^. From left to 
right, estimated values of q{l, 1), /x(l) and v for values of n ranging from 0.5 to 128 thousands of 
samples. Box and whiskers plot based on 100 independent realizations. 
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to n but will converge, as n grows, to a deterministic limit which only depends on the parameter 
initialization. This is all the more regrettable that from a statistical perspective, it is known that 
the true maximum likelihood estimator does converge, as rate n~^/^ towards the actual value 
6-k of the parameter. This behavior of the batch EM algorithm is illustrated on Figure [2] which 
displays, from left to right, the estimation results for the parameters associated with the first 
component (q{l,l), ^(1)), as in Figured! together with the noise variance v (rightmost box). 
The 100 realizations are here summarized as box and whiskers plots. Figure [21 which should be 
compared with Figure [3] below, shows that when using a potentially large (here, 50) but fixed 
number of iterations the variability of the batch EM estimates does decrease but the accuracy 
does not improve as n grows. Clearly, statistical consistency could only be achieved by using more 
batch EM iterations as the number n of observations grows. 

In contrast. Figure [3] which corresponds to the online EM algorithm outlined above as Al- 
gorithm [2] used with 7„ = n""'^ does suggest that online EM estimation is consistent. For the 
smallest sample sizes {n = 500 or n = 2000), the estimation variance still is quite large and the 
online estimates are not as good as those obtained using 50 batch EM iterations. But for sam- 
ple sizes of n = 8000 and larger, the online EM estimates are preferable despite their somewhat 
larger variance. In this application, the c hoice of a slowly decreasing step-size — 7„ = n~^'^ was 
chosen following the recommendations of ICappe and Moulines (|2009l i— appears to be of utmost 
importance. In particular, th e choice = des pite its strong analogy with the batch EM 
case (see Section [3] as well as ( Neal and Hinton . 19991 )) provides estimates that converge much to 
slowly to be usable in any practical application. This observation is certainly a consequence of 
the temporal dependence between the observations and, correlatively, of the time taken by the 
filtering and smoothing relations to forget their initial state (as in the example under consider- 
ation the assumptions of Theorem [1] as satisfied for in a neighborhood of 6^, with a constant 
e = g+(l, 2) = 0.05, which is rather small). 



Method 


MATLAB 7.7 


OCTAVE 3.0 


Online EM 


1.57 


5.66 


Batch EM (one iteration, recursive) 


1.24 


3.98 


Batch EM (one iteration, forward-backward) 


0.31 


2.94 



Table 1: Computing times in seconds for a record of length n 
processor). 



10000 observations (2.4 GHz 



Note that the comparison between Figure [2] and Figure [3] is not meant to be fair in terms of 
computing time, as shown by Table The 50 batch EM iterations used to produce Figure [2] 
take about 10 to 40, depending on the implementation of batch EM, times longer than for the 
corresponding online estimates of Figure [3l Being fair in this respect would have mean using 
just five batch EM iterations which, as can be guessed from Figure [H is not competitive with 
online EM, even for the smallest sample sizes. Note that a different option would have been to 
also consider run i iing t he online EM algorithm several times on the same batch data, following 
Neal and HintonI (|l999l ). In the case of hidden Markov models however, this way of using the 
online EM algorithm for fixed sample maximum likelihood estimation of the parameters appears 
to be less straightforward than in the case of i.i.d. data and has not been considered. 

The two batch EM implementations featured in Table [1] correspond, respectively, to the use of 



^The MATLAB/OCTAVE code used for the simulations is very simple and will be made available from the web. 
It is mostly vectorized, except for a loop through the observations, and hence it is expected that the differences in 
running times are indeed representative, despite the use of an interpreted programming language. 
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the recursive form of smoothing based on Proposition [T] and to the usual forward-backward form 
of smoothing. The former implementation is obviously related to the online EM algorithm, which 
explains that both of them lead to rather similar running times. As discussed in Section [3. 3| due 
to the fact that the whole m x m transition matrix q is here used as a parameter, the numerical 
complexity of the online EM algorithm and of the recursive implementation of batch EM scale as 
m^, compared to only for the batch EM algorithm when implemented with forward-backward 
smoothing. Hence, it is to be expected that the forward-backward implementation of batch EM 
would be even more advisable for models with more than m = 2 states. On the other hand, when 
m is large it is usually not reasonable to parameterize the transition matrix q by its individual 
entries. 

In order to provide a more detailed idea of the asymptotic performances of the algorithm. 
Figure H] displays results similar to those of Figure [3] but centered and scaled as follows. Each 
parameter estimates, say 9n is represented as \/n{6n — 0^) and further scaled by the asymptotic 
standard deviation of 9 deduced from the inverse of the Fisher information matrix. The Fisher 
information matrix has been estimated numerically by applying Fisher's identity to (|27|) so as to 
obtain 

-Vg(ij) log^e yO:n) = — 7— X T- T 1 < J < m , 

-V^(i) log4(yo:n) = ■ , 



iv,log^,(yO:n) = (S^,2(i;^) -2/U(i)5^,l(z) +m'W51o( 

1=1 



The information matrix has then been obtained by averaging the gradient computed in 9-^ for 100 
independent sequences of length one million simulated under 9-^. 

Add i tional l y, Figure Bl also displays results that have been post-processed using averaging 



(jPolvakl . Il99d . iRupperti 11988 ). In Figure HI Polyak-Ruppert averaging is used starting from 
'^avg = 8000. That is, for n > 8000, 0„ is replaced by l/(n — 8000) X]"=8ooi ^n- For time indices n 
smaller than 8000, averaging is not performed and the estimates are thus as in Figure [3l except 
for the centering and the scaling. Under relatively mild assumptions, averaging has been shown 
to improve the asymptotic rate of convergence of stochastic approxima t ion a lgorithm making it 



possible to recover the optimal rate of l/\/n (see Cappe and Moulinesl . 20091 . for an illustration 



in the case of the online EM for independent observations). At least for //(I) and v, Figured] 
suggest that in this example the proposed algorithm does reach asymptotic efficiency, i.e., becomes 
asymptotically equivalent to the maximum likelihood estimator. For g(l, 1) the picture is less clear 
as the recentered and scaled estimates present a negative bias which disappear quite slowly. This 
effect is however typical of the practical trade-off involved in the choice of the index navg where 
averaging is started. To allow for a significant variance reduction, navg should not be too large. On 
the other hand, if averaging is started too early, forgetting of the initial guess of the parameters 
occurs quite slowly. In the present case, the negative bias visible on the left panel of Figure[3]is due 
to riavg being too small (see corresponding panel in Figure [3]). Although, this could be corrected 
here by setting riavg to twenty thousands or more, it is important to underline that optimally 
setting navg is usually not feasible in practice. 
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5r-. • • 1 ^ 8^ • • • 5^ 1 • • r 




-10- 1 ^ 



0.5 2 8 32 128 0.5 2 8 32 128 0.5 2 8 32 128 

samples X 1000 samples x 1000 samples x 1000 

Figure 4: Estimation results when using the onhne EM algorithm with 7„ = n~^'^ with Polyak- 
Ruppert started after n = 8000. From left to right, estimated values of q{l,l), and v for 
values of n ranging from 0.5 to 128 thousands of samples. The estimated values are centered and 
scaled so as to be comparable with a unitary asymptotic standard deviation. Box and whiskers 
plot based on 100 independent realizations. 



5 Conclusions 

The algorithm proposed in this paper for online estimation of HMM parameters is based on 
two ideas. The first, which is inspired by Sato ( 200d ). Cappe and Moulined ^M) consists in 



reparameterzing the model in the space of sufficient statistics and approximating the limiting 
EM recursion using a stochastic approximation procedure. Theorem [1] provides a first argument 
demonstrating that this idea is also fruitful in the case of HMMs. The second element is more 
specific to HMMs and relies on the recursive implementation of smoothing computations for sum 
functionals of the hidden state which is used in Algorithm [TJ As discussed in Section [3l this 
possibility requires thctt tlie auxiliciry quantity Pn,v,6 

defined in (fT3]) be approximated during the 

course of the algorithm. 

Although the performance reported in Section H] is encouraging, there are several questions 
raised by this approach. The first is of course the theoretical analysis of the convergence of 
Algorithm [U which is still missing. Although originally inspired by stochastic approximation 
ideas, it seems that Algorithm [1] would be difficult to analyze using currently available stochastic 
approximation results due to the kernel convolution involved in (|17p . As discussed in Sections 13.31 
and 14.31 the proposed algorithm may become less attractive, from a computational point of view, 
when used in models with many distinct state values. For such cases, it would be of interest to 
consider specific versions of the algorithm, using some form of approximation, perhaps based on 
Monte Carlo simulations. 
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A Proof of Theorem [T] 



Theorem [T] mainly relies on the use of a two-sided forgetting result which is first proved as C orol- 
lary [1] below. This result generalizes the one-sided forgetting bounds of Douc et al. ( 20041 ) and 
allow conditioning with respect to both past and future observations, which is required for study- 
ing the asymptotic be havior of (171) and related quantities. The proof of Theorem [1] then mostly 
relies on the results of Douc et al. 



Lemma 1. Given q a transition matrix on the finite set X such that q{x,x') > e > and a and 
P probabilities on X , define 

a{x)q{x, x')j3{x') 



J, 



'x,x') 



Then 



\J, 



J, 



02,g,/32 111 



< 



\ai - a2\\i + WPi - /32II1) 



(30) 



where ||/i||i = denotes the or total variation norm. 

Proof. Lemm a [1] is obviously related to the application of Bayes' formula. Hence, one may apply 
Lemma 3.6 of iKiinschI ( 200ll ) to obtain \\Jai,q,i3i — Ja2,q,i32\\i ^ ill"^! Pi — 0.2 ® /32||i- The r.h.s. 
of ([30j) is obtained by noting that \ai{x)(3i{x') — a2{x)l32{x')\ < \ai{x) — a2{x)\(3i{x') + |/?i(x') — 
(32{x')\a2{x). □ 

Corollary 1. Under the assumption of Theorem{l\ for any function f such that < f < 
and probabilities /ii and fi2 on X^, and any index 1 < t < n, 



Eg [f{Xt-l,Xt)\ Yo.,n,Xo = x,Xn = x'] lJLl{x,x') 

x,x'£X^ 

- ^ ^e[f{Xt-uXt)\Yo:n,Xo = X,Xn = x]li2{x,x') 



< 



where p = (1 — e). 

Proof. First apply Lemma [T] to the familiar forward-backward decomposition 

ai{x) = Ve{Xt-i = x|yo:t-i,^o = x^^i) , 

I3i{x') OC Ve{Yt:n,Xn = Xn,i\Xt = x') , 

for i = 1,2 (where the normalization factor in the second equation is determined by the constraint 
J2x£xl^ii^) = 1) to obtain 



Ee [f{Xt-l,Xt)\ Yo:n,Xo = Xo,l,Xn = Xn,l\ 



Ee [f{Xt-l,Xt) \ Yo-n, Xq = Xo,2, Xn = Xn,2] 



< 



\ai - a2\\i + WPi - /32II1) , 



observing that Pg{Xt-i = x, Xt = a;^|Yh:^-i,^n = xo,i,Xn = Xn,i) = Jai,qg,i3i- Next use, the 
one-sided forgetting bounds of IDouc et al. ( 2004 ) (Corollary 1 and Eq. (20)) to obtain ||q!i — 



02\\i < P* ^ and — /92||i < p" *• The result of Corollary [T] follow by the general inequality 
1/^1(5) - f^2{g)\ < ^IIpi - P2II1 sup^^^^2e22 \g{zi) - g{z2)\. □ 
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Note the fact that the backward function (3i{x') = Pe(li:„,X„ 



\Xt = x') may be nor- 



mahzed to a pseudo-probabihty, has been used in the a bove proof. This is generahy not the case 
outside of the context where X is finite ( Briers et al. . 2004 . Cappe et al. . 20051 ) but it is easily 
checked that Cor ollary [J holds in gre ater generality under the "strong mixing conditions" discussed 
in Section 4.3 of lCappe et al.l (j2005l ). 



Proof of Theorem [ij Corollary [T] implies that 



for m > n, where m{y) is an upper bound for |s(X_i, Xq, lo)|i which may be chosen as m{y) = 
l + \s^{x, y)\ due to the fact that s''{x, x') is a vector of indicator functions when X is finite. As, 
9i, G interior (O), standard results on exponential family imply that miYo) admits finite moments 
of all orders under Pg^. Hence, the a.s. limit of Eg [s(X_i, Xq, lo)| ^-m:m] as m — > oo, which is 
denoted by Eg [s(X_i, Xq, Yb)! cxDioo]) exists and has finite expectation under Pe*. Similarly, 



n 

- V (E,,e [s{Xt-i,XuYt)\ yo:n] - Eg [s{Xt-uXt,Yt)\ Y-oo-.oo]) < — 
n ne 



t=i 



As Fi0^[m{YQ)P] < oo for all p, standard applications of Markov inequality and Borel-Cantelli 
Lemma imply that the r.h.s. of the above expression tends a.s. to zero. Hence, the quanti- 
ties ^E^^g E"=i s{Xt-i,Xt, Yt)\ lo:n] and ^ J2t=i [Y^t=i siX^i,Xo, Yo)\ yioo:oo] have the same 
limit, where the latter expression converges to Eg^ (Eg Xq, lo)| ^-oo:oo]) by the ergodic 

theorem. This proves the first assertion of Theorem [TJ 

For the second s tatement, one can check that the assumptions of Theorem [1] imply (Al)-(A3) 

of' 
of 



Douc et al.l (|2004) as well as a rm of (A6)-(A8jl. ence, proceeding as in proof of Theorem 3 
20041 ) . shows that ^ converge a.s. to the gradient Vgcg^{0) of the limiting contrast 
defined in ([6]). Eq. (jlOp combined with the previous result then shows that parameter values 9 for 
which Vgcg^ (9) vanishes are also such that {'S/gip{9),Eg^ (Eg [s{X_i, Xo,Yo)\Y_oo:qo])) — 'VgA{9) = 



0, that is, 9{Eg^ (Eg [s(X_i,Xo,yo)|>^-oo:oo])} 



□ 
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